[AURON #2366] fix: Handle Paimon metadata columns in V2 native scan#2367
Conversation
There was a problem hiding this comment.
@lyne7-sc, thanks for the fix! The overall approach is sound: materialize __paimon_file_path/__paimon_bucket as per-file constants via partitionSchema, and fall back to Spark for unsupported metadata columns. The functional Test Paimon 1.2 CI job (which runs the new integration tests) is green.
|
@SteNicholas Thanks for the careful review! Addressed the comments in the latest update, and the relevant ci is green now. |
|
@lyne7-sc, could you provide your wechat which I could discuss with you? |
There was a problem hiding this comment.
@lyne7-sc, thanks for updates. The approach — reusing the partition-constant mechanism rather than reading these from Parquet/ORC — is clean. I found two correctness issues (verified against the decompiled Paimon 1.2.0 sources) plus a couple of value-fidelity edge cases and test-coverage gaps; inline comments below. Recommend addressing the two confirmed bugs (__paimon_file_path encoding, and the partition-key/metadata name collision) before merge.
Minor (not blocking): in toPartitionValueTemplate, SQLConf.get.resolver and indexByName = partitionKeys.zipWithIndex.toMap are split-invariant but rebuilt per split; and partitionKeys() is fetched twice (a Set at L131 and a Seq at L175). Worth hoisting into computePlan. Also consider whether file_path could be materialized on the executor (as NativeIcebergTableScanExec.metadataPartitionValues does) instead of baked into a per-file InternalRow on the driver.
SteNicholas
left a comment
There was a problem hiding this comment.
LGTM. The native Paimon V2 scan correctly handles __paimon_file_path and __paimon_bucket, falls back for unsupported metadata columns, and properly distinguishes physical/partition columns that collide with metadata names. The test coverage (multi-file splits, non-zero buckets, name collisions, partitioned tables, special-character partition values) is thorough.
One thing I verified closely: the __paimon_file_path value is built with new Path(rawFilePath).toUri.toString. This matches Paimon 1.2.0, whose PaimonRecordReaderIterator materializes the column via filePath().toUri().toString() (the percent-escaped form) — confirmed against the paimon-spark-3.x:1.2.0 artifact. So results agree with vanilla Paimon, including the '50%' partition case. (Note for the future: newer Paimon switched this to Path.toString()/unescaped, so if Auron ever bumps the Paimon dependency this rendering will need to follow.)
A few optional, non-blocking nits:
PaimonScanSupport.scala— intoPartitionValueTemplate, theelse if (isFilePathMetadataColumn(field.name)) nullbranch returns the samenullas the trailingelse(dead) and, unlikeisPartitionValueField/filePathMetadataIndex, omits the!isPhysicalColumnguard. Safe today only because a physical column reachespartitionSchemasolely as a real partition key; consider dropping the branch or adding the guard for consistency.metadataFilePath = new Path(...).toUri.toStringis computed for every data file even when no__paimon_file_pathcolumn is projected (filePathMetadataIndex < 0) and then discarded — could be guarded behindfilePathMetadataIndex >= 0.containsName/isFilePathMetadataColumn/isBucketMetadataColumnre-fetchSQLConf.get.resolverper call inside the per-field/per-file loop, thoughcomputePlanalready bindsresolver.- In
isPaimonMetadataColumn, thecontainsName(PaimonMetadataColumns, name)clause is subsumed by thestartsWith("__paimon_")prefix check, so thePaimonMetadataColumnsset is redundant. dataFile.externalPath().orElse(s"...")eagerly builds the fallback string even whenexternalPath()is present;orElseGet(...)would defer it.
None of these block the change.
Which issue does this PR close?
Closes #2366
Rationale for this change
Paimon metadata columns are produced by the Paimon scan layer rather than stored as physical columns in data files. The Paimon V2 native scan was passing these columns to the native Parquet/ORC reader as file columns, which can return incorrect values.
For example:
The native path returned null for
__paimon_file_path, while Spark/Paimon's scan path returns the actual file path.What changes are included in this PR?
Are there any user-facing changes?
No API changes. This is a correctness fix for Paimon V2 native scan.
How was this patch tested?
Adds Paimon V2 integration tests